House Sales Prediction in King County, USA

Introduction

  • Kaggle의 오픈 데이터에서 강의용 데이터로 쓸만한 회귀분석용 데이터를 찾던 중 이 데이터를 찾게 되었다. 링크

  • 이 데이터셋은 2014년 5월부터 2015년 5월까지 매매된 King County의 집값 데이터를 포함하고 있다.

  • 이 데이터를 활용해서 강의에서 전달하고자 하는 바는 Feature EngineeringData Exploration의 힘이다. 정말 간단한 몇 개의 과정을 반복하는 것만으로도 예측 모델의 성능을 어디까지 끌어올릴 수 있는지를 보여주고자 했다.

  • 강의를 위해서 준비한 자료를 공유하는 것이기 때문에, 자세한 코멘트는 달지 않았다. 추후 시간을 더 들여서 자세한 설명을 담은 포스트를 작성하고자 한다.

Tying Shoes

### Load the libraries
library(lubridate)

Attaching package: 'lubridate'
The following object is masked from 'package:base':

    date
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)

Attaching package: 'GGally'
The following object is masked from 'package:dplyr':

    nasa
library(corrplot)
library(ggmap)
### Load the dataset
House <- read_csv("data/king_county/kc_house_data.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
head(House)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
7129300520 2014-10-13 221900 3 1.00 1180 5650 1 0 0 3 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
6414100192 2014-12-09 538000 3 2.25 2570 7242 2 0 0 3 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
5631500400 2015-02-25 180000 2 1.00 770 10000 1 0 0 3 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
2487200875 2014-12-09 604000 4 3.00 1960 5000 1 0 0 5 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
1954400510 2015-02-18 510000 3 2.00 1680 8080 1 0 0 3 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
7237550310 2014-05-12 1225000 4 4.50 5420 101930 1 0 0 3 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930

Visualizing (Heat Map)

### Initialize a map for King County
kingCounty <- get_map(location = 'issaquah',
                      zoom = 9,
                      maptype = "roadmap"
)
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=issaquah&zoom=9&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=issaquah&sensor=false
### Generate a heat map
ggmap(kingCounty) + 
    geom_density2d(data = House, aes(x = long, y = lat), size = .3) + 
    stat_density2d(data = House, aes(x = long, y = lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
    scale_fill_gradient(low = "green", high = "red") + 
    scale_alpha(range = c(0.2, 0.4), guide = FALSE)

  • 대부분의 데이터가 시애틀 지역을 기반에 두고 있으며, 외곽의 버클리나 스노퀄미 등의 지역의 데이터도 포함되어 있다.

Data Preparation

House %>%
    mutate(
        sale_year = year(date),
        sale_month = month(date)
    ) %>%
    select(-id, -date) -> House
set.seed(2017)
trainIdx <- sample(1:nrow(House), size = 0.7 * nrow(House))
train <- House[trainIdx, ]
test <- House[-trainIdx, ]

Benchmark

bench_model <- lm(price ~ ., data = train)
summary(bench_model)

Call:
lm(formula = price ~ ., data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1289272   -98433    -9562    76172  4400147 

Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -7.428e+07  1.179e+07  -6.301 3.03e-10 ***
bedrooms      -3.369e+04  2.203e+03 -15.296  < 2e-16 ***
bathrooms      4.165e+04  3.863e+03  10.782  < 2e-16 ***
sqft_living    1.444e+02  5.198e+00  27.778  < 2e-16 ***
sqft_lot       1.202e-01  5.446e-02   2.207 0.027360 *  
floors         5.802e+03  4.248e+03   1.366 0.172069    
waterfront     5.701e+05  2.039e+04  27.952  < 2e-16 ***
view           5.602e+04  2.482e+03  22.572  < 2e-16 ***
condition      2.925e+04  2.802e+03  10.439  < 2e-16 ***
grade          9.494e+04  2.551e+03  37.217  < 2e-16 ***
sqft_above     2.731e+01  5.154e+00   5.299 1.18e-07 ***
sqft_basement         NA         NA      NA       NA    
yr_built      -2.489e+03  8.579e+01 -29.018  < 2e-16 ***
yr_renovated   2.151e+01  4.304e+00   4.997 5.89e-07 ***
zipcode       -5.576e+02  3.909e+01 -14.265  < 2e-16 ***
lat            6.133e+05  1.268e+04  48.360  < 2e-16 ***
long          -2.092e+05  1.559e+04 -13.422  < 2e-16 ***
sqft_living15  2.897e+01  4.061e+00   7.134 1.02e-12 ***
sqft_lot15    -2.804e-01  8.390e-02  -3.342 0.000833 ***
sale_year      3.893e+04  5.581e+03   6.976 3.16e-12 ***
sale_month     1.914e+03  8.338e+02   2.296 0.021713 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 198800 on 15109 degrees of freedom
Multiple R-squared:  0.7017,    Adjusted R-squared:  0.7013 
F-statistic:  1870 on 19 and 15109 DF,  p-value: < 2.2e-16
benchmark <- predict(bench_model, test)
benchmark <- ifelse(benchmark < 0, 0, benchmark)
  • 벤치마크 모델의 경우 굉장히 나쁜 성능을 보일 것은 자명하다. 밑의 과정들을 통해서 성능을 개선해보자.

Data Exploration

### Generate a heat map
ggmap(kingCounty) + 
    geom_point(data = train, aes(x = long, y = lat, color = log(price), alpha = log(price))) + 
    scale_color_gradient(low = "green", high = "red")

  • price를 로그화하여 시각화한 결과로, 남부(lat < 47.5)보다 북부(lat >= 47.5) 쪽의 가격이 더 높음을 알 수 있다.
  • 그 중에서도 해변가에 인접한 곳의 가격이 더 높다.

Correlation

cor_House <- cor(House[, -1])
corrplot(cor_House, order = "hclust")

Boxplots

Grade - Price
train %>%
    mutate(grade = factor(grade)) %>%
    ggplot(aes(x = grade, y = price, fill = grade)) +
    geom_boxplot() + 
    geom_point(
        data = train %>% 
            group_by(grade) %>%
            summarise(median = median(price)) %>%
            mutate(grade = factor(grade)),
        aes(x = grade, y = median, group = 1),
        size = 5, stroke = 2,
        color = "black", fill = "white", shape = 23
    )

  • grade가 한 단계 높아질 때마다 가격이 기하급수적으로 증가하는 것으로 보인다. 확인을 위해서 log(price)에 대해서 박스플롯을 그려본다.
train %>%
    mutate(grade = factor(grade)) %>%
    ggplot(aes(x = grade, y = log(price), fill = grade)) +
    geom_boxplot() + 
    geom_point(
        data = train %>% 
            group_by(grade) %>%
            summarise(median = median(log(price))) %>%
            mutate(grade = factor(grade)),
        aes(x = grade, y = median, group = 1),
        size = 5, stroke = 2,
        color = "black", fill = "white", shape = 23
    )

Year Build - Price
train %>%
    mutate(yr_cat = cut(yr_built, breaks = seq(1900, 2020, by = 10),
                        labels = paste0(seq(1900, 2010, by = 10), "s"))) %>%
    ggplot(aes(x = yr_cat, y = log(price), fill = yr_cat)) + 
    geom_boxplot()

  • 건물이 지어진 연대와 가격 사이에는 큰 인사이트를 얻기 힘들어 보인다.
Year Renovated - Price
train %>%
    filter(yr_renovated != 0) %>%
    mutate(renovated_cat = cut(yr_renovated, breaks = seq(1930, 2020, by = 10),
                        labels = paste0(seq(1930, 2010, by = 10), "s"))) %>%
    ggplot(aes(x = renovated_cat, y = log(price), fill = renovated_cat)) + 
    geom_boxplot()

  • 집을 개조한 경우, 최근에 개조할 수록 가격이 조금이라도 증가하는 경향을 보인다.
Is there any difference between renovated / non-renovated
train %>%
    mutate(isRenovated = factor(ifelse(yr_renovated != 0, 1, 0))) %>%
    ggplot(aes(x = isRenovated, y = log(price), fill = isRenovated)) + 
    geom_boxplot()

  • 개조한 집의 가격이 대체로 비싸게 책정됨을 알 수 있다.
Year Saled - Price / Month Saled - Price
train %>%
    mutate(sale_year = factor(sale_year)) %>%
    ggplot(aes(x = sale_year, y = log(price), fill = sale_year)) + 
    geom_boxplot()

train %>%
    mutate(sale_month = factor(sale_month)) %>%
    ggplot(aes(x = sale_month, y = log(price), fill = sale_month)) + 
    geom_boxplot()

  • 두 변수 모두 가격에 영향을 미치는 것으로 보이진 않는다.
Bathrooms - Price
train %>%
    mutate(bathrooms = factor(bathrooms)) %>%
    ggplot(aes(x = bathrooms, y = log(price), fill = bathrooms)) + 
    geom_boxplot()

  • log(price)bathrooms는 유사 선형관계를 가진다.
Coordinate - Price
train %>%
    ggplot(aes(x = lat, y = log(price), color = lat)) + 
    geom_line() + geom_point(shape = 21)

train %>%
    ggplot(aes(x = long, y = log(price), color = long)) + 
    geom_line() + geom_point(shape = 21)

  • 위도와 경도 모두 특정 영역에서 높은 가격대가 형성이 되어 있다. 변수를 새로 생성해서 영역을 분리하는 것이 도움이 될 것으로 보인다.
    • Latitude : ~47.5 / 47.5 ~ 47.6 / 47.6 ~
Zip Code - Price
sort(unique(train$zipcode)) == sort(unique(test$zipcode))
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  • 트레이닝 데이터와 테스트 데이터에 존재하는 Zip Code는 동일하다.
train %>%
    arrange(zipcode) %>%
    mutate(zipcode = factor(zipcode)) %>%
    ggplot(aes(x = zipcode, y = log(price), fill = zipcode)) + 
    geom_boxplot()

  • one-hot encoding 으로 데이터를 확장하는 것을 고려하자.

Feature Engineering

Split the latitude
splitLat <- function(data){
    data <- data %>%
        dplyr::mutate(lat1 = ifelse(lat <= 47.5, lat, 0),
                      lat2 = ifelse(lat > 47.5 & lat <= 47.6, lat, 0),
                      lat3 = ifelse(lat > 47.6, lat, 0)) %>%
        dplyr::select(-lat)
    return(data)
}

train <- splitLat(train)
test <- splitLat(test)
Is this house renovated?
train <- train %>%
    mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))

test <- test %>%
    mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))
How old is this house?
train <- train %>%
    mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))

test <- test %>%
    mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))
Zip Code (one-hot encoding)
train$zipcode <- factor(train$zipcode)
test$zipcode <- factor(test$zipcode)
zipcode_train <- data.frame(model.matrix(price ~ 0 + zipcode, data = train))
zipcode_test <- data.frame(model.matrix(price ~ 0 + zipcode, data = test))
train <- train %>%
    select(-zipcode) %>%
    cbind(zipcode_train)

test <- test %>%
    select(-zipcode) %>%
    cbind(zipcode_test)

Modeling

model <- lm(log(price) ~ . - age, data = train)
summary(model)

Call:
lm(formula = log(price) ~ . - age, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.34416 -0.09529  0.00901  0.10456  1.00032 

Coefficients: (2 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.714e+02  1.346e+01 -12.730  < 2e-16 ***
bedrooms       3.479e-03  2.059e-03   1.689 0.091173 .  
bathrooms      3.477e-02  3.599e-03   9.661  < 2e-16 ***
sqft_living    1.293e-04  4.842e-06  26.704  < 2e-16 ***
sqft_lot       5.834e-07  5.052e-08  11.547  < 2e-16 ***
floors        -2.649e-02  4.293e-03  -6.170 6.99e-10 ***
waterfront     4.806e-01  1.919e-02  25.050  < 2e-16 ***
view           5.914e-02  2.350e-03  25.163  < 2e-16 ***
condition      6.474e-02  2.656e-03  24.372  < 2e-16 ***
grade          8.779e-02  2.486e-03  35.317  < 2e-16 ***
sqft_above     7.064e-05  4.956e-06  14.255  < 2e-16 ***
sqft_basement         NA         NA      NA       NA    
yr_built      -1.922e-04  8.836e-05  -2.175 0.029630 *  
yr_renovated   3.774e-03  4.898e-04   7.706 1.38e-14 ***
long          -2.737e-01  6.348e-02  -4.312 1.63e-05 ***
sqft_living15  8.924e-05  3.933e-06  22.692  < 2e-16 ***
sqft_lot15     2.001e-07  7.989e-08   2.505 0.012267 *  
sale_year      6.783e-02  5.156e-03  13.157  < 2e-16 ***
sale_month     3.005e-03  7.699e-04   3.904 9.52e-05 ***
lat1           2.803e-01  9.154e-02   3.062 0.002203 ** 
lat2           2.827e-01  9.148e-02   3.090 0.002005 ** 
lat3           2.830e-01  9.142e-02   3.096 0.001967 ** 
isRenovated   -7.437e+00  9.778e-01  -7.606 2.99e-14 ***
zipcode98001  -6.005e-01  3.569e-02 -16.826  < 2e-16 ***
zipcode98002  -6.169e-01  3.805e-02 -16.212  < 2e-16 ***
zipcode98003  -5.981e-01  3.546e-02 -16.869  < 2e-16 ***
zipcode98004   2.899e-01  2.205e-02  13.144  < 2e-16 ***
zipcode98005  -4.905e-02  2.638e-02  -1.859 0.062979 .  
zipcode98006  -1.368e-01  2.759e-02  -4.957 7.23e-07 ***
zipcode98007  -1.256e-01  2.798e-02  -4.491 7.14e-06 ***
zipcode98008  -1.272e-01  2.610e-02  -4.872 1.11e-06 ***
zipcode98010  -2.845e-01  4.438e-02  -6.410 1.50e-10 ***
zipcode98011  -3.850e-01  2.631e-02 -14.635  < 2e-16 ***
zipcode98014  -3.793e-01  4.221e-02  -8.985  < 2e-16 ***
zipcode98019  -4.399e-01  3.589e-02 -12.257  < 2e-16 ***
zipcode98022  -4.542e-01  4.833e-02  -9.398  < 2e-16 ***
zipcode98023  -6.624e-01  3.506e-02 -18.894  < 2e-16 ***
zipcode98024  -2.737e-01  4.448e-02  -6.154 7.75e-10 ***
zipcode98027  -1.882e-01  3.233e-02  -5.819 6.03e-09 ***
zipcode98028  -4.215e-01  2.326e-02 -18.120  < 2e-16 ***
zipcode98029  -1.143e-01  3.417e-02  -3.345 0.000825 ***
zipcode98030  -5.425e-01  3.469e-02 -15.638  < 2e-16 ***
zipcode98031  -5.334e-01  3.254e-02 -16.393  < 2e-16 ***
zipcode98032  -6.521e-01  3.605e-02 -18.087  < 2e-16 ***
zipcode98033  -3.732e-02  2.174e-02  -1.717 0.086068 .  
zipcode98034  -2.916e-01  2.130e-02 -13.687  < 2e-16 ***
zipcode98038  -3.838e-01  3.739e-02 -10.265  < 2e-16 ***
zipcode98039   4.156e-01  3.461e-02  12.009  < 2e-16 ***
zipcode98040   7.704e-02  2.650e-02   2.907 0.003657 ** 
zipcode98042  -5.180e-01  3.494e-02 -14.824  < 2e-16 ***
zipcode98045  -1.803e-01  4.814e-02  -3.745 0.000181 ***
zipcode98052  -1.571e-01  2.412e-02  -6.513 7.62e-11 ***
zipcode98053  -1.793e-01  2.955e-02  -6.067 1.34e-09 ***
zipcode98055  -4.773e-01  3.013e-02 -15.839  < 2e-16 ***
zipcode98056  -4.082e-01  2.752e-02 -14.830  < 2e-16 ***
zipcode98058  -4.547e-01  3.093e-02 -14.702  < 2e-16 ***
zipcode98059  -3.241e-01  2.930e-02 -11.062  < 2e-16 ***
zipcode98065  -2.801e-01  4.154e-02  -6.742 1.62e-11 ***
zipcode98070  -3.817e-01  3.460e-02 -11.033  < 2e-16 ***
zipcode98072  -3.325e-01  2.740e-02 -12.135  < 2e-16 ***
zipcode98074  -2.092e-01  2.808e-02  -7.449 9.93e-14 ***
zipcode98075  -1.881e-01  3.287e-02  -5.723 1.07e-08 ***
zipcode98077  -3.620e-01  3.144e-02 -11.514  < 2e-16 ***
zipcode98092  -5.397e-01  3.801e-02 -14.199  < 2e-16 ***
zipcode98102   1.037e-01  2.557e-02   4.055 5.04e-05 ***
zipcode98103  -4.703e-02  1.643e-02  -2.863 0.004201 ** 
zipcode98105   8.999e-02  2.070e-02   4.348 1.38e-05 ***
zipcode98106  -4.844e-01  2.412e-02 -20.081  < 2e-16 ***
zipcode98107  -3.260e-02  1.897e-02  -1.718 0.085781 .  
zipcode98108  -4.541e-01  2.633e-02 -17.243  < 2e-16 ***
zipcode98109   1.566e-01  2.431e-02   6.442 1.22e-10 ***
zipcode98112   2.175e-01  1.997e-02  10.893  < 2e-16 ***
zipcode98115  -3.645e-02  1.718e-02  -2.122 0.033879 *  
zipcode98116  -6.941e-02  2.326e-02  -2.985 0.002844 ** 
zipcode98117  -6.218e-02  1.637e-02  -3.799 0.000146 ***
zipcode98118  -3.441e-01  2.410e-02 -14.274  < 2e-16 ***
zipcode98119   1.244e-01  2.082e-02   5.975 2.35e-09 ***
zipcode98122  -3.109e-02  1.934e-02  -1.608 0.107939    
zipcode98125  -2.822e-01  1.917e-02 -14.719  < 2e-16 ***
zipcode98126  -2.726e-01  2.384e-02 -11.438  < 2e-16 ***
zipcode98133  -4.177e-01  1.852e-02 -22.556  < 2e-16 ***
zipcode98136  -1.398e-01  2.514e-02  -5.559 2.76e-08 ***
zipcode98144  -1.551e-01  2.358e-02  -6.578 4.92e-11 ***
zipcode98146  -4.871e-01  2.546e-02 -19.131  < 2e-16 ***
zipcode98148  -4.666e-01  3.982e-02 -11.717  < 2e-16 ***
zipcode98155  -4.417e-01  2.033e-02 -21.720  < 2e-16 ***
zipcode98166  -3.702e-01  2.870e-02 -12.898  < 2e-16 ***
zipcode98168  -6.160e-01  2.705e-02 -22.774  < 2e-16 ***
zipcode98177  -2.814e-01  2.055e-02 -13.693  < 2e-16 ***
zipcode98178  -5.612e-01  2.746e-02 -20.436  < 2e-16 ***
zipcode98188  -5.367e-01  3.216e-02 -16.688  < 2e-16 ***
zipcode98198  -5.940e-01  3.122e-02 -19.028  < 2e-16 ***
zipcode98199          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1832 on 15038 degrees of freedom
Multiple R-squared:  0.8798,    Adjusted R-squared:  0.8791 
F-statistic:  1223 on 90 and 15038 DF,  p-value: < 2.2e-16
Evaluation Metric: RMSLE

평가 메트릭으로 Root Mean Squared Logarithmic Error(RMSLE)를 사용한다. 해당 메트릭은 과대평가된 항목보다는 과소평가된 항목에 페널티를 준다.

\[ RMSLE = \sqrt{\frac{1}{n} \sum^n_{i=1} \left( \log(p_i + 1) - \log(a_i + 1)\right)^2} \]

rmsle <- function(predict, actual){
    if(length(predict) != length(actual))
        stop("The length of two vectors are different.")
    
    len <- length(predict)
    rmsle <- sqrt((1/len) * sum((log(predict + 1) - log(actual + 1))^2))
    return(rmsle)
}
Test
pred <- predict(model, test)
pred <- exp(pred)

result <- rmsle(pred, test$price)
benchmark_result <- rmsle(benchmark, test$price)
cat("RMSLE (Benchmark): ", benchmark_result, "\nRMSLE (Final): ", result)
RMSLE (Benchmark):  0.9280684 
RMSLE (Final):  0.1794783
  • 데이터에 아무런 조작을 하지 않고, 로그 스케일을 고려하지 않은 벤치마크 모델의 RMSLE 값은 0.9280684이다. 로그 스케일과 피쳐 엔지니어링을 이용한 모델의 예측력은 다섯 배 가량 개선되었다. 최종 모델의 RMSLE는 0.1794783이다.